By Cao Yueran
Previous studies on years of lives lost have shown that in general, the lives of the citizens of more developed countries are cut short due to degenerative diseases such as stroke and heart disease, while for developing countries, infectious disease still remains a leading cause of death. However, previous measurements of "development" use GDP per capita or qualitative analysis which does not take into account the inequality of a country. This projects attempts to find a correlation between the causes of mortality and two indexes that are relatively resistant to inequality, BMI and CMR.
Specific causes of mortality is generally a very difficult type of census data to gather, especially for less financially capable countries. These countries also tend to be the ones most in need of accurate data as to why their citizens are dying young. If BMI and CMR, two widely availabe indexes could accurately indicate a country's developmental status in regards to healthcare (as reflected in causes of mortality), then countries would be able to use them as a basis for targeted healthcare policies instead of causes of mortality.
Mortality data is acquired from https://apps.who.int/healthinfo/statistics/mortality/whodpms/. The website is very difficult for a machine to navigate, so 16 .xml files were chosen and downloaded manually, each pertaining to a general cause of death. The .xml files were subsequently converted to .csv files for easy import. Each .xml contains Age-Standardized Death Rates (ASDRs) by country, by year (1979-2016).
The WHO world standard population were used to calibrate the mortality data. The ASDRs were calculated by WHO, so it is using the same formula as the one used in the BMI dataset (also by WHO). This should make the results a bit more indicative.
The WHO BMI dataset contains 25000 datapoints of the mean male, female and combined BMI of each country, per year (1975-2016). I am only interested in the combined BMI, for the ASDR are combined as well. Within each data cell, it also has the uncertainty bounds, but I will only be using the mean BMI.
The UNICEF child mortality dataset contains 20000 datapoints of median child mortality rate by country, by year (1950-2019). It contains both sexes, male and female statistics, but once again we are only interested in both sexes.
HDI indexes to determine a country's state of development is taken from http://hdr.undp.org/sites/default/files/hdro_statistical_data_tables_1_15_d1_d5.xlsxv
The index for diseases is available locally.
import numpy as np, pandas as pd
import requests
from matplotlib import pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
pd.options.display.max_rows = 20
pd.options.mode.chained_assignment = None
disease_codes = pd.read_csv("disease_codes.csv", dtype={"disease": str})
disease_codes.index = np.arange(1, len(disease_codes) + 1)
disease_codes
16 .xml files containing ASDRs will now be imported and merged into one DataFrame.
forConcat = []
for x in range(1, len(disease_codes) + 1):
temp = pd.read_csv(str(x) + ".csv", header=1)
temp.insert(1, "Cause", x)
forConcat.append(temp)
mdf = pd.concat(forConcat, ignore_index=True)
mdf = mdf.loc[mdf.Countries != "Total reporting countries"]
mdf.rename({"Countries": "Country"}, axis=1, inplace=True)
mdf.reset_index(inplace=True, drop=True)
mdf.replace(" ", np.nan, inplace=True)
mdf
Let us proceed to map the names of causes onto Cause.
mdf.insert(
2, "CauseName", mdf.Cause.map(dict(zip(disease_codes.index, disease_codes.name)))
)
mdf.sort_values(by=["Country", "Cause"], inplace=True)
mdf.reset_index(inplace=True, drop=True)
mdf
Let us remove the countries with no datapoints from 1979 until 2016. They are of no use to our analysis. The other missing data points would be left in for now, as it may come in handy during the plotting of a choropleth map to show which countries' infrastructure are so bad that they have no mortality data.
missingPerCountry = (
mdf.groupby("Country")
.apply(lambda x: x.notna().sum())
.sum(axis=1)
.sort_values(ascending=False)
)
noDataCountries = list(missingPerCountry[missingPerCountry <= 48].index)
mdf.drop(mdf[mdf.Country.isin(noDataCountries)].index, inplace=True)
mdf.reset_index(inplace=True, drop=True)
mdf
Reverse the year numbers for easier graph plotting.
mdf = mdf[
["Country", "Cause", "CauseName"] + list(mdf.columns[:2:-1])
] # I don't get why this is [:0:-1] instead of [1::-1] but ok.
mdf
Let us assign ISO codes for each country and remove the locations that are not countries. We will also remove the rows with not a single data point. I chose to use get instead of search_fuzzy because the latter is way too slow.
I will also map the codes back onto the name to remove any inconsistencies as pycountry is not a 1-to-1 mapping.
import pycountry
def get_country_code(name):
try:
return pycountry.countries.lookup(name).alpha_3
except:
return None
def get_country_name(code):
try:
return pycountry.countries.get(alpha_3=code).name
except:
return None
mdf.insert(0, "CountryCode", mdf.Country.apply(get_country_code))
mdf2 = mdf.copy()
mdf.drop(mdf[mdf.CountryCode.isna()].index, inplace=True)
mdf.dropna(thresh=5, inplace=True)
mdf.reset_index(inplace=True, drop=True)
mdf[mdf.CountryCode == "MDK"]
We have to check that the countries left out are actually not countries, not just incorrectly named. If they are actually countries, then we should rename them and concatenate them back into mdf.
leftOut = mdf2[mdf2.CountryCode.isna()]
display(leftOut.Country.value_counts())
countryMapping = {
"Hong Kong SAR": "Hong Kong",
"Macau": "Macao",
"Reunion": "Réunion",
"Saint Vincent and Grenadines": "Saint Vincent and the Grenadines",
"Virgin Islands (USA)": "Virgin Islands, U.S.",
"TFYR Macedonia": "North Macedonia",
"Iran (Islamic Rep of)": "Iran, Islamic Republic of",
"Republic of Korea": "Korea, Republic of",
"Venezuela (Bolivarian Republic of)": "Venezuela, Bolivarian Republic of",
}
leftOut.replace(countryMapping, inplace=True)
leftOut.drop("CountryCode", inplace=True, axis=1)
leftOut.insert(0, "CountryCode", leftOut.Country.apply(get_country_code))
print("number of nan: " + str(leftOut.CountryCode.isna().sum()))
mdf = pd.concat([mdf, leftOut], ignore_index=True)
mdf
mdf.Country = mdf.CountryCode.apply(get_country_name)
mdf.CountryCode.isna().sum()
bmidf = pd.read_csv("bmi.csv")
bmidf.drop([0, 1], inplace=True)
bmidf.rename({"Unnamed: 0": "Country"}, axis=1, inplace=True)
bmidf.reset_index(inplace=True, drop=True)
criteria = (bmidf.loc[0] == " Both sexes") | (bmidf.loc[0] == "Country")
bmidf = bmidf[
criteria.index[criteria]
] # This is a really good way to boolean index columns
bmidf
As one can see, bmidf only contains the BMI values for both sexes, so we can remove that row. We can also remove the uncertainty bounds, keeping only the first number which indicates the mean BMI.
bmidf.drop(0, inplace=True)
bmidf.reset_index(inplace=True, drop=True)
bmidf
bmicountries = bmidf.loc[:, "Country"]
def check_float(f):
try:
float(f)
return True
except ValueError:
return False
bmidf = bmidf.loc[:, "2016":].applymap(
lambda x: float(x.split(" ")[0]) if check_float(x.split(" ")[0]) else None
)
bmidf.insert(0, "Country", bmicountries)
bmidf
Similar to mortality data, BMI data also has some country names inconsistent with the ISO standard. We should be able to fix it using the same map as the one before.
bmidf.replace(countryMapping, inplace=True)
bmidf.insert(0, "CountryCode", bmidf.Country.apply(get_country_code))
bmidf.dropna(inplace=True)
bmidf.Country = bmidf.CountryCode.apply(get_country_name)
display(bmidf.CountryCode.isna().sum())
display(bmidf)
Also reverse the year numbers to make plotting graphs easier.
bmidf = bmidf[["CountryCode", "Country"] + list(bmidf.columns[:1:-1])]
bmidf
Remove data prior to 1979:
bmidf.drop([str(x) for x in range(1975, 1979)], axis=1, inplace=True)
bmidf.reset_index(inplace=True, drop=True)
bmidf
Check for missing data:
display(bmidf.dtypes.value_counts())
display(bmidf.isna().sum().sum())
cmrdf = pd.read_excel("cmr.xlsx", sheet_name=0, header=14)
cmrdf = cmrdf.loc[:584]
cmrdf
Note that we only need the Median rows of CMR. I will also standardize the column names.
cmrdf = cmrdf[cmrdf["Uncertainty.Bounds*"] == "Median"]
cmrdf.reset_index(inplace=True, drop=True)
cmrdf.drop("Uncertainty.Bounds*", axis=1, inplace=True)
cmrdf.rename(
{"ISO.Code": "CountryCode", "Country.Name": "Country"}, axis=1, inplace=True
)
cmrdf
Let us standardize the year naming scheme.
cmrdf.columns = cmrdf.columns.str.replace("\.5", "")
cmrdf
display(
cmrdf.isna()
.sum()
.plot(kind="bar", figsize=(15, 10), xlabel="Column", ylabel="Number of NaN")
)
The NaN values seem to be concentrated towards the datapoints taken earlier, mostly before the 1980s. Data before 1979 is useless anyways, so there shouldn't be too much impact leaving a few nans in.
Let us remove the data prior to 1979 and after 2016.
cmrdf.drop(
[str(x) for x in list(range(1950, 1979)) + list(range(2017, 2020))],
axis=1,
inplace=True,
)
cmrdf
cmrdf suffers from the same problem of iso-uncompliant country names, so let's fix that right now.
cmrdf.Country = cmrdf.CountryCode.apply(get_country_name)
display(cmrdf)
df = pd.read_excel("hdi.xlsx", sheet_name=0)
with pd.option_context("display.max_rows", None):
display(df)
Why doesn't a single database use standardised country codes?????? This is the same thing as before, only this time I was too lazy to type out the entire mapping so I just used pycountry's fuzzy_search instead. The country codes and names have been standardised, so it should be the same as that of the dfs before it.
hdidf = df.iloc[7:199, [1, 2]].reset_index(drop=True)
hdidf.columns = ["Country", "HDI"]
hdidf.Country.replace(
{"Congo (Democratic Republic of the)": "Congo, the Democratic Republic of the"},
inplace=True,
)
hdidf.insert(0, "CountryCode", hdidf.Country.apply(get_country_code))
not_matched = hdidf[hdidf.CountryCode.isna()]
hdidf = hdidf[~hdidf.CountryCode.isna()]
display(hdidf)
not_matched = not_matched[~not_matched.Country.str.contains("DEVELOPMENT")]
not_matched.Country = not_matched.Country.str.split(",|\(", expand=True)[0]
not_matched.CountryCode = not_matched.Country.apply(
lambda x: pycountry.countries.search_fuzzy(x)[0].alpha_3
)
hdidf = pd.concat([hdidf, not_matched])
hdidf.Country = hdidf.CountryCode.apply(get_country_name)
hdidf.sort_values(by="CountryCode", inplace=True)
hdidf.reset_index(inplace=True, drop=True)
display(not_matched)
display(hdidf)
Data cleaning is now complete, and the datasets are as follows:
display(HTML("<h5>Mortality Causes DataFrame: </h5>"))
display(mdf)
display(HTML("<h5>BMI DataFrame: </h5>"))
display(bmidf)
display(HTML("<h5>CMR DataFrame: </h5>"))
display(cmrdf)
display(HTML("<h5>HDI DataFrame: </h5>"))
display(hdidf)
Let us create some long versions of the DataFrames as well.
longmdf = pd.melt(
mdf,
id_vars=["CountryCode", "Country", "Cause", "CauseName"],
var_name="Year",
value_name="DeathsPer100k",
)
longbmidf = pd.melt(
bmidf, id_vars=["CountryCode", "Country"], var_name="Year", value_name="BMI"
)
longcmrdf = pd.melt(
cmrdf, id_vars=["CountryCode", "Country"], var_name="Year", value_name="CMR"
)
display(longmdf)
display(longbmidf)
display(longcmrdf)
In general, the leading cause of mortality has shifted from infectious and parasitic diseases to diseases of the circulatory system to neoplasms as time progresses.
Circulatory system illness remain by far the most common cause of death with neoplasms trailing behind. The other causes are far below these two.
Countries that are outliers when it comes to causes of death generally remain outliers for at least a decade. This shows that in order to be an outlier, there must be something habitual that these countries cannot fix quickly.
Let us find the proportion of deaths caused by each Cause globally. Note that the total number of deaths per year changes, so we need to divide the deaths per 100,000 per Cause by total deaths per 100,000.
deathsPerYearByCause = mdf.groupby("CauseName").mean()
display(deathsPerYearByCause)
display(deathsPerYearByCause.sum())
dpybcNormal = (deathsPerYearByCause / deathsPerYearByCause.sum()).drop("Cause", axis=1)
dpybcNormal.T.plot(kind="line", figsize=(20, 10), colormap='gist_rainbow').legend(bbox_to_anchor=(1, 1))
dpybcNormal.T.plot(kind="line", figsize=(20, 10), colormap='gist_rainbow', ylim=(0, 0.1)).legend(
bbox_to_anchor=(1, 1)
)
The cause with the highest fraction of deaths by far is that of circulatory diseases. This has been decreasing since the 1990s but have recently risen.
The second is neoplasms, which is essentially cancer. This has been gradually increasing since the 1980s, but have recently spiked.
At around 1984, cardiovascular diseases suddenly took a dip while then number 2, 4, 5 (neoplasms, digestive diseases, infectious diseases) spiked.
The third is respiratory diseases. Respiratory diseases in general have been decreasing as the world becomes more developed.
Endocrine, nutritional and metabolic diseases have been on the rise quite rapidly since 1994.
Let us make an animated map of cardiovascular disease deaths across time. The data has to be interpolated based on last recorded number of deaths. This would prevent the countries' color from snapping in between an actual color and gray.
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
mdf.loc[:, "1979":"2016"].apply(pd.to_numeric, errors="coerce")
mdf["1979"] = mdf["1979"].astype(float)
display(mdf)
# with pd.option_context('display.max_rows', None):
mdfInterp = pd.concat(
[mdf.iloc[:, :4], mdf.loc[:, "1979":"2016"].interpolate(method="pad", axis=1)],
axis=1,
)
display(mdfInterp)
We will now convert the DataFrame to a "long" format from the current "wide" format by using the method melt. This prepares the data for the choropleth visualization.
forMap = pd.melt(
mdfInterp, id_vars=["CountryCode", "Country", "Cause", "CauseName"], var_name="Year"
)
forMap.reset_index(inplace=True, drop=True)
forMap.rename(columns={"value": "DeathsPer100k"}, inplace=True)
forMap.DeathsPer100k = forMap.DeathsPer100k.astype(float)
display(forMap)
We will now divide DeathsPer100k by the total number of deaths in that country, in that year. This is because different countries have different total rates of death, and this question aims to analyse the proportion of deaths due to a particular cause globally across the years.
totalDeathsSeries = forMap.groupby(["CountryCode", "Year"]).DeathsPer100k.sum()
def calculate_death_prop(row):
if row.DeathsPer100k != np.nan:
totalDeaths = totalDeathsSeries.loc[(row.CountryCode, row.Year)]
row.DeathsPer100k = row.DeathsPer100k / totalDeaths
return row
display(forMap)
forMapCalibed = forMap.apply(
calculate_death_prop, axis=1
) # re-calibrated DataFrame for map creation
forMapCalibed.rename(columns={"DeathsPer100k": "DeathsProp"}, inplace=True)
display(forMapCalibed)
We will now normalize the data using the min-max method.
forMapCalibed.DeathsProp /= (
forMapCalibed.DeathsProp.max() - forMapCalibed.DeathsProp.min()
)
forMapCalibed
As one can see, the DeathsPer100k have been successfully converted to DeathsProp and the normalization has also done its job. For instance, the deaths figure for CountryCode=VIR and Cause=14 has been converted from 8.3 originally to 0.027 after calibration to 0.033 after normalization. Now we can make our animated choropleth maps.
Note: Please see individual .html files for the results.
for cause in range(1, 17):
forMapWithCause = forMapCalibed[forMapCalibed.Cause == cause]
fig = px.choropleth(
forMapWithCause, # Input Dataframe
locations="CountryCode", # identify country code column
color="DeathsProp", # identify representing column
hover_name="Country", # identify hover name
animation_frame="Year", # identify date column
projection="natural earth", # select projection
color_continuous_scale="Viridis", # select preferred color scale
range_color=[0, forMapWithCause.DeathsProp.max()], # select range of dataset
title=str(cause) + ". " + disease_codes.loc[cause, "name"],
)
fig.write_html("q1choropleth" + format(cause, "02d") + ".html", full_html=False)
I will go over some graphs that I find to be interesting.
Graph 9, diseases of the circulatory system: You can see a clear trend of the American countries gradually having circulatory system diseases become a smaller part of their causes of death. Most of Europe is exhibiting the same trend as well, with Eastern Bloc nations lagging behind the First World by about 30 years. Kazakhstan is the exception here, as it has reduced its circulatory disease deaths proportion significant in recent years. Egypt is roughly following the same trend as the Eastern Bloc.
Surprisingly, Mexico and Guatemala has since 1979 shown a very low proportion of deaths due to this cause. This breaks the trend of more developed countries having a lower proportion.
Graph 2, neoplasms: Compared to graph 9, the trend here is much less clear. In recent years, many developed countries have started to have a high proportion of this cause of death. In contrast to cardiovascular diseases, the countries' colors also tend to fluctuate a bit more between light and dark, indicating that the trend is much less obvious.
Graph 5, mental and behavioural disorders: First World countries have been experiencing a higher number of deaths due to this cause in the past 20 years, starting with Finland. I think this may be due to mental health becoming better documented in developed countries. Moreover, in many developing countries, parents would discreetly kill their child if they perceive something to be wrong with their minds early on, which makes this statistic a tad unreliable.
Graph 11, digestive system: Mexico and Guatemala has consistently shown a higher proportion of deaths due to this cause. Egypt and Eastern Bloc countries have also been especially vulnerable in recent years.
Graph 10, respiratory system: Guatemala is exceptionally high here. Most of South America as well as Kazakhstan have seen increases in this cause in the past few years.
Now we shall plot the leading cause of death globally over time. This will be similar to the choropleth plotted above but this time the data is discrete rather than continous. We have to first make a DataFrame of leading cause of death and number of deaths (for easy reference) per year by country.
Note from the future Steve:
Plotly is open-source so expect bugs. The following chunk of code creates placeholder datapoints. Without them, the only shown leading causes of deaths would the ones during the first year (1979). In the following years, if another leading cause of death appears, the choropleth map will simply not show it. This persists until the original leading causes are no longer present in the current year.
This is clearly a bug on their part, so I will take the liberty of using some comparatively inefficient for loops to sort out the issue.
leadingCauses = list(
pd.DataFrame(
forMap.replace(np.nan, 0)
.groupby(["Country", "Year"])
.apply(
lambda x: x.rename(
{"Country": "ColumnCountry", "Year": "ColumnYear"}, axis=1
).loc[x.DeathsPer100k.idxmax()]
)
).Cause.unique()
)
vat = {
"CountryCode": [],
"Country": [],
"Cause": [],
"CauseName": [],
"Year": [],
"DeathsPer100k": [],
}
for x in range(1979, 2017):
for y in leadingCauses:
vat["CountryCode"].append("ph" + str(y))
vat["Country"].append("ph" + str(y))
vat["Cause"].append(y)
vat["CauseName"].append(disease_codes.loc[y, "name"])
vat["Year"].append(str(x))
vat["DeathsPer100k"].append(0)
vatdf = pd.DataFrame(vat)
forDMap = pd.concat([forMap, vatdf], ignore_index=True)
display(forDMap)
maxCauseOfDeath = pd.DataFrame(
forDMap.replace(np.nan, 0)
.groupby(["Country", "Year"])
.apply(
lambda x: x.rename(
{"Country": "ColumnCountry", "Year": "ColumnYear"}, axis=1
).loc[x.DeathsPer100k.idxmax()]
)
)
Now, we can plot a discrete choropleth map.
maxCauseOfDeath.ColumnYear = maxCauseOfDeath.ColumnYear.astype(int)
maxCauseOfDeath.Cause = maxCauseOfDeath.Cause.astype(str)
fig = px.choropleth(
maxCauseOfDeath,
color="CauseName",
locations="CountryCode",
hover_data=["ColumnCountry", "DeathsPer100k"],
animation_frame="ColumnYear",
)
fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.9))
fig.write_html("q1discretechoro.html")
With the exception of Guatemala, which has consistently had respiratory illnesses as it's primary cause of death, the rest of the nations experience the causes of death roughly in this order as they become more developed: infectious, circulatory and then neoplasms.
I believe the reasons are as follows. Infectious diseases used to be commonplace as the healthcare infrastructure of many Second and Third World countries were unsatisfactory. Once this problem has been resolved, the two remaining are both degenerative diseases. Circulatory diseases are partially due to lifestyle decisions such as diet and exercise, and so as the citizens of very developed countries educate themselves in these regards, proportion of deaths due to circulatory diseases have decreased. Lastly, there is still no cure for cancer, so neoplasms is the final major cause of death that even exceptionally developed countries are struggling with today.
Let us also plot some boxplots to identify the outlier countries of each cause of death each year.
topCauses = (
mdf.groupby("Cause").sum().sum(axis=1).sort_values(ascending=False).index[:5]
)
forBoxplot = longmdf.copy()
forBoxplot.reset_index(inplace=True, drop=True)
forBoxplot.DeathsPer100k = forBoxplot.DeathsPer100k.astype(float)
# This is similar to the calibrated data used to create choropleth maps except there is no interpolation.
totalDeathsSeries = forBoxplot.groupby(["CountryCode", "Year"]).DeathsPer100k.sum()
display(forBoxplot)
forBoxplotCalibed = forBoxplot.apply(
calculate_death_prop, axis=1
) # re-calibrated DataFrame for Boxplot creation
forBoxplotCalibed.rename(columns={"DeathsPer100k": "DeathsProp"}, inplace=True)
forBoxplotCalibed.DeathsProp /= (
forBoxplotCalibed.DeathsProp.max() - forBoxplotCalibed.DeathsProp.min()
)
display(forBoxplotCalibed)
Now, onto plotting the boxplots. I will be using a whisker size of 2 IQR instead of the conventional 1.5 IQR because this dataset has high kurtosis, and too many outliers makes the plot very cluttered. I will also be labelling the outliers so as to get a clear indication of which nations have been outlying throughout the years.
fig, axs = plt.subplots(len(topCauses), figsize=(20, 100))
forBoxplotCondensed = forBoxplotCalibed[forBoxplotCalibed.Cause.isin(topCauses)]
for x in range(len(topCauses)):
# Label outliers:
# dptc stands for DeathsPerTopCause
dptc = forBoxplotCondensed[forBoxplotCondensed.Cause == topCauses[x]]
q1 = dptc.groupby(dptc.Year).quantile(0.25)["DeathsProp"].to_numpy()
q3 = dptc.groupby(dptc.Year).quantile(0.75)["DeathsProp"].to_numpy()
outlier_top_lim = q3 + 2 * (q3 - q1)
outlier_bottom_lim = q1 - 2 * (q3 - q1)
# print(outlier_top_lim)
for row in dptc.itertuples():
year = int(row.Year) - 1979
# print(type(year), year)
val = row.DeathsProp
if val > outlier_top_lim[year] or val < outlier_bottom_lim[year]:
axs[x].text(year + 0.1, val, row.CountryCode, ha="left", va="center")
sns.boxplot(data=dptc, y="DeathsProp", x="Year", ax=axs[x], whis=2)
axs[x].set_title(disease_codes.loc[topCauses[x], "name"])
axs[x].set_ylabel("Fraction of Deaths /1")
Circulatory: Guatemala had notably lower proportion of deaths due to this cause prior to 1991.
Neoplasms: Quite a large IQR with few outliers, indicating that no matter how developed a country is, cancer will still take away roughly the same proportion of lives. However, it has been increasing in recent years, as evidenced by the choropleths as well.
Respiratory: Before 2006, Guatemala had exceptionally high proportion of deaths due to this cause. After 2006, Singapore took its position and has maintained this outlier status for the past 10 years. I think this may be due to the high number of smokers after Singapore's GDP per capita rose and people could afford cigarettes.
Digestive: Egypt, North Macedonia and Mexico have all had high proportions of deaths due to this cause, with the latter 2's outlier status persevering from the 1980s until now. Kiribati also experienced a spike in this statistic in the 1990s. Perhaps it is due to the ulcer-inducing spicy food found in the borders of these countries.
Infectious and Parasitic: Guatemala and Kiribati have had notably higher proportions of death due to this cause from the 1980s to the 1990s. In recent years, Thailand and South Africa have dominated the charts, both being significantly outside of the range of 2 IQR. These countries have quite a bit of inequality (e.g. South Africa had apartheid) when it comes to urbanisation, so that may be why.
As BMI increases, the death rate for most causes of death decreases. This is probably because BMI only increases when a country becomes more developed and people can eat more nutritious meals.
The only exceptions are nervous system and behavioural disorders, which are positively correlated with BMI. I will give further explanation below.
On a country level, Russia appears to be an exception as it's number of deaths due to digestive system illnesses is also positively correlated with BMI.
Correlation and Regplots
Let us find out which causes of death are most correlated with BMI through correlation values and regplots.
bmiMort = pd.merge(
longmdf, longbmidf, on=["CountryCode", "Country", "Year"], how="inner"
)
_bmiMort = pd.merge(longmdf, longbmidf, on=["CountryCode", "Year"], how="inner")
bmiMort.DeathsPer100k = bmiMort.DeathsPer100k.astype(float)
display(bmiMort)
display(_bmiMort)
Notice how the number of rows is the same whether Country is include in the on parameter. This shows that the mapping between Country and CountryCode is 1-to-1.
We shall now proceed to plot regplots of every countries data throughout the years. We will be removing the outliers (defined arbitrarily based on quantile). If you decrease the quantile range, you would see that the number of data points remaining is decreased, which shows that the code is correct.
from scipy import stats
import math
def find_correl(df, a, b):
dfnona = df.dropna()
return stats.pearsonr(dfnona[a], dfnona[b])
# append correlation between two variables grouped by Cause
def append_correl(df, a1, b1):
correls = df.groupby("Cause").apply(find_correl, a=a1, b=b1)
diseases = df.Cause.unique()
correldf = pd.DataFrame(
[[a, correls[a][0], correls[a][1]] for a in diseases],
columns=["Cause", "Correl", "pValue"],
)
return pd.merge(df, correldf, on="Cause")
def remove_outliers(df):
outlier_boundaries = (
df.groupby("Cause").DeathsPer100k.quantile([0.05, 0.95]).unstack(level=1)
)
nooutlier = df[
(
df.DeathsPer100k
>= outlier_boundaries.iloc[df.Cause - 1, 0].reset_index(drop=True)
)
& (
df.DeathsPer100k
<= outlier_boundaries.iloc[df.Cause - 1, 1].reset_index(drop=True)
)
]
return nooutlier
# remove outliers + append_correl
display(bmiMort)
bmiMort_nooutlier = remove_outliers(bmiMort)
display(bmiMort_nooutlier)
bmiMort2 = append_correl(bmiMort_nooutlier.dropna(), "BMI", "DeathsPer100k")
display(bmiMort2)
def plot4x4reg(df, indepvar):
# To make it work for other number of causes as well
count = len(disease_codes)
maxwidth = math.ceil(count**0.5)
fig, axes = plt.subplots(maxwidth, maxwidth, figsize=(30, 20))
diseases_sorted = df.sort_values(by="Correl").CauseName.unique()
for ax, cause in zip(axes.ravel(), diseases_sorted):
dfs = df.query("CauseName==@cause")
sns.regplot(
data=dfs,
x=indepvar,
y="DeathsPer100k",
scatter_kws={"s": 3},
x_jitter=0.05,
ax=ax,
).set_title(
cause + " (" + str(round(dfs.Correl.iloc[0], 2)) + ", p="
# round p value to 3 d.p as anything less than 0.001 is statistically significant
+ str(round(dfs.pValue.iloc[0], 4)) + ")",
fontsize=8,
)
plot4x4reg(
append_correl(bmiMort.dropna(), "BMI", "DeathsPer100k"), "BMI"
) # with outliers
Now, without outliers the regplots are much tighter along the y-axis.
plot4x4reg(bmiMort2, "BMI")
As one can see, there is a huge amount of variability in the data (barring the jitter set in place to compensate for decimal point issues) so there is no strong correlation between death originating from major causes and BMI. However, what if we take the average of DeathsPer100k for every country per year. This would theoretically fix the massive amount of variation between the developed and developing countries at any point in time.
def get_mean_deaths(df, indepvar):
meanDeaths = df.groupby(["Year", "Cause", "CauseName"]).agg(
{"DeathsPer100k": ["mean"], indepvar: ["mean"]}
)
meanDeaths.reset_index(inplace=True)
meanDeaths.columns = meanDeaths.columns.droplevel(1)
return meanDeaths
def reg_mean_deaths(df, indepvar):
meanDeaths = get_mean_deaths(df, indepvar)
meanDeaths = append_correl(meanDeaths, indepvar, "DeathsPer100k")
display(meanDeaths)
sns.color_palette("bright")
sns.lmplot(
data=meanDeaths,
x=indepvar,
y="DeathsPer100k",
hue="CauseName",
scatter_kws={"s": 5},
height=8,
aspect=1.5,
legend=False,
)
legend = (
meanDeaths.CauseName
+ " ("
+ meanDeaths.Correl.round(2).astype(str)
+ ", p="
+ meanDeaths.pValue.round(4).astype(str)
+ ")"
).unique()
plt.legend(title="Cause", bbox_to_anchor=(1.04, 1), loc="upper left", labels=legend)
plot4x4reg(meanDeaths, indepvar)
display(bmiMort2)
reg_mean_deaths(bmiMort2, "BMI")
As one can see, the gradient of the regression lines (i.e. whether they are increasing or decreasing) are in line with the correlation values in the legend. This shows that this plot is correct.
Analysis: One would expect that as BMI increases, deaths due to cardiovascular reasons would increase. However this is not the case. There is a very strong negative correlation (c=-0.93) between BMI and circulatory system diseases. The same is observed for many other causes of death. I think this is because a country with high BMI is a country that can cater to the nutritional needs of its citizens very well, so it is developed and healthcare standards should be high as well, and deaths due to various illnesses decrease. The only cause that shows a strong positive correlation with BMI is mental and behavioural disorders. Perhaps this is due to the same reason as the choropleth maps above. I.e., in developed countries, people with mental illnesses actually survive long enough to be considered a valid data point.
Let us now try to narrow it down to any particular country. We will keep the outliers in for this one as while the deaths figures may be outliers on the global scale, on a country-by-country basis they could still be statistically significant. NOTE: I will only be drawing conclusions from "important" countries such as the United States, but this model can be used for any country with data.
def reg_per_country(df, indepvar):
inp = ""
while inp.lower() != "stop":
inp = input("Enter alpha-3 country code: ").upper()
if inp in df.CountryCode.values:
country = df[df.CountryCode == inp]
print(country.Country.iloc[0])
# Plot only causes that have a variety of number of deaths.
gbCause = country.groupby("Cause")
largeDiff = gbCause.DeathsPer100k.max() - gbCause.DeathsPer100k.min() > 10
country = country[largeDiff[country.Cause].values]
country.drop(
["Correl", "pValue"], axis=1, inplace=True
) # recalculate correlation
country = append_correl(country, indepvar, "DeathsPer100k")
toPlot = country[
# Ensure correlation is strong and valid
(abs(country.Correl) >= 0.5)
& (country.pValue <= 0.05)
]
# Sort by sum of deaths to make the line colors a bit more in sync for easier cross-referencing.
longmdf.DeathsPer100k = longmdf.DeathsPer100k.astype(float)
sort_by_this = (
longmdf.groupby("Cause").DeathsPer100k.sum().sort_values().index
)
toPlot.Cause = toPlot.Cause.astype("category")
toPlot.Cause.cat.set_categories(sort_by_this, inplace=True)
toPlot.sort_values(["Cause"], inplace=True, ascending=False)
display(toPlot)
plot = sns.lmplot(
data=toPlot,
x=indepvar,
y="DeathsPer100k",
hue="CauseName",
scatter_kws={"s": 10},
height=10,
aspect=1.5,
legend=False,
)
legend = (
toPlot.CauseName
+ " ("
+ toPlot.Correl.round(2).astype(str)
+ ", p="
+ toPlot.pValue.round(4).astype(str)
+ ")"
).unique()
plt.legend(title="Cause", loc="upper right", labels=legend)
plt.show(plot)
reg_per_country(append_correl(bmiMort, "BMI", "DeathsPer100k").dropna(), "BMI")
United States: Nervous system illnesses and mental disorders are related, and they are both positively correlated with BMI. The only explanation I can think of is the one previously mentioned. U.S.'s data is in accordance with the rest of the world.
Russia: The main thing to note here is the strong positive correlation between digestive system diseases and BMI. Perhaps the Russians' diet is quite unhealthy, so as they consume more nutrition, their risk of digestive system illness also increase.
United Kingdom: Similar to U.S., except there is also a negative correlation bewtween respiratory system illnesses and BMI. Perhaps as the U.K. became more developed, people gained the awareness to smoke less and environmental standards went up as well, reducing the number of deaths due to this cause.
Brazil: Nothing out of the ordinary. Developing countries are not that different from developed countries in terms of this correlation.
As CMR increases, the death rate for most causes of death increases. This is probably because CMR is only high when a country has poor health infrastructure.
The only exceptions are nervous system and behavioural disorders, which are negatively correlated with CMR.
On a country level, Russia appears to be an exception as it's number of deaths due to digestive system illnesses is also negatively correlated with CMR.
cmrMort = pd.merge(
longmdf, longcmrdf, on=["CountryCode", "Country", "Year"], how="inner"
)
_cmrMort = pd.merge(longmdf, longcmrdf, on=["CountryCode", "Year"], how="inner")
cmrMort.DeathsPer100k = cmrMort.DeathsPer100k.astype(float)
display(cmrMort)
display(_cmrMort)
As per BMI, cmrdf's Country column is also in accordance with that of mdf. Now, let's plot some regplots but without outliers.
cmrMort2 = append_correl(remove_outliers(cmrMort).dropna(), "CMR", "DeathsPer100k")
display(cmrMort2)
plot4x4reg(cmrMort2, "CMR")
Same as BMI, correlation on a global scale has too much variance to be statistically significant. The cause with the highest correlation is related to childbirth, which makes sense as a high number of the deaths occurring due to childbirth can be classified as "child mortality".
reg_mean_deaths(cmrMort2, "CMR")
The correlations are almost the exact opposite of that of BMI. Diseases of the nervous system and mental disorders are positively correlated with CMR, once again proving my hypothesis that many people with undiagnosed mental disorders are dying young, so they will not be tallied as a death due to those causes.
reg_per_country(append_correl(cmrMort, "CMR", "DeathsPer100k").dropna(), "CMR")
Almost the reverse correlations as BMI. Once again, Russia is the anomaly here, with deaths due to digestive diseases correlating negatively with CMR. I cannot explain this with a direct reason. Perhaps it is due to the same reason as BMI, where as Russia gets more developed, its citizens eat unhealthily enough to develop digestive disorders.
The only cause of death that could be predicted to some extent from these two statistics is illnesses of the circulatory system. The model that has been created works best when the country is in the transition phase between developing and developed.
We will use the data for every developed country (HDI > 0.9) as over the course of 38 years, most of them have went through several stages of development and can be used as a predictor for the rest of the world in the future. This would be free from the variance (perhaps due to war and instabilities) that's present in developing countries. Feel free to adjust the HDI threshold to something other than 0.9, but from my experience doing so, the resulting model is really all over the place. The scatterplots look to be randomly generated in that case. The failed graphs have been omitted for the sake of saliency.
I will not be taking the mean of all countries throughout the years as that produces too few data points and is prone to overfitting.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
altogether = pd.merge(
bmiMort2.drop(['Correl', 'pValue'], axis=1),
cmrMort2[["CountryCode", "Cause", "Year", "CMR"]],
on=["CountryCode", "Cause", "Year"],
)
# to show that there is an hdi value for every country code in altogether
# notice number of rows is the same
display(len(altogether))
altogether = pd.merge(altogether, hdidf, on=["CountryCode", "Country"], how="left")
display(len(altogether))
# feel free to uncomment these lines for mean deaths per year,
# but there are too few data points, leading to underfitting
# when it comes to a country by country basis.
# it will not be plotted for the sake of saliency.
# for_mlm = pd.concat([get_mean_deaths(altogether,'BMI'),
# get_mean_deaths(altogether,'CMR')['CMR']],
# axis=1)
for_mlm = altogether[altogether.HDI > 0.9]
for_mlm
With the combined dataset manufactured, we can now proceed to create a multiple linear regression model using BMI and CMR to predict any cause of death we so choose. This data is free from outliers. distplots will be used to determine roughly how good a model they are.
def build_model(cause):
mlm = LinearRegression()
by_cause = for_mlm.query("Cause==@cause")
# normalize
by_cause.DeathsPer100k /= (
by_cause.DeathsPer100k.max() - by_cause.DeathsPer100k.min()
)
x_train, x_test, y_train, y_test = train_test_split(
by_cause[["CMR", "BMI"]],
by_cause["DeathsPer100k"],
test_size=0.2,
random_state=4132,
)
mlm.fit(x_train, y_train)
return (mlm.intercept_, mlm.coef_, mlm.predict(x_test), y_test)
fig, axes = plt.subplots(len(disease_codes), 1, figsize=(20, 80))
fig.tight_layout(pad=3.0)
models = []
for cause in range(1, len(disease_codes) + 1):
model = build_model(cause)
name = disease_codes.loc[cause, "name"]
models.append([name, mean_squared_error(model[3], model[2]), model])
ax1 = sns.distplot(
model[3], hist=False, color="r", label="Actual Value", ax=axes[cause - 1]
)
axes[cause - 1].set_title(name)
sns.distplot(model[2], hist=False, color="b", label="Fitted Value", ax=ax1)
The model is really bad at predictions for most causes of death, so let's narrow it down a bit.
Now, let's take the top five models ranked by mean-squared-error and see how they fare using some bar plots.
models.sort(key=lambda x: x[1])
fig, axes = plt.subplots(5, figsize=(30, 40))
for x in range(5):
name = models[x][0]
mse = models[x][1]
actual_vs_predicted = pd.DataFrame([models[x][2][2], models[x][2][3]]).T
axes[x].set_title(name)
actual_vs_predicted.columns = ["Predicted", "Actual"]
actual_vs_predicted.sort_values(by="Predicted", inplace=True)
actual_vs_predicted.plot(kind="bar", ax=axes[x])
fig, axes = plt.subplots(3,2, figsize=(30, 30))
for x in range(5):
name = models[x][0]
mse = models[x][1]
actual_vs_predicted = pd.DataFrame([models[x][2][2], models[x][2][3]]).T
axes.ravel()[x].set_title(name)
actual_vs_predicted.columns = ["Predicted", "Actual"]
actual_vs_predicted.sort_values(by="Predicted", inplace=True)
actual_vs_predicted.plot(kind="scatter", x='Predicted', y='Actual',ax=axes.ravel()[x])
The model for eye disease is clearly inaccurate. Diseases relating to blood has too few data points and is also too inaccurate. The only reason the model for infectious and parasitic diseases has low MSE is because few people die from infectious diseases in developed countries. That is clearly not a good model either.
Congenital malformations is directly correlated with CMR for obvious reasons, so it is not interesting. The only one left that is somewhat interesting is diseases of the circulatory system. Let's do some more analysis on that.
display(models[2])
model_data = models[2][2]
by_cause = for_mlm.query("Cause==9")
scale_factor = by_cause.DeathsPer100k.max() - by_cause.DeathsPer100k.min()
values = np.array([model_data[0], model_data[1][0], model_data[1][1]])
values *= scale_factor
def predict_point(row):
# display(row)
bmi, cmr = row.BMI, row.CMR
return values[0] + values[1]*cmr + values[2]*bmi
print(
"Equation for Number of Deaths/100k: y = "
+ str(values[0])
+ " + "
+ str(values[1])
+ "a + "
+ str(values[2])
+ "b"
)
Where a is CMR and b is BMI.
Let's try to use this model to predict a country's diseases of the circulatory system deaths per 100k as time progresses.
mses = []
for country in altogether.CountryCode.unique():
df = altogether[(altogether.CountryCode==country)&(altogether.Cause==cause)]
df['Predicted'] = df.apply(predict_point,axis=1)
mses.append([country, mean_squared_error(df.Predicted, df.DeathsPer100k)])
mses.sort(key=lambda x: x[1])
mses = pd.DataFrame(mses,columns=['CountryCode', 'MSE'])
with pd.option_context("display.max_rows", None):
display(mses)
fig, axes = plt.subplots(10,5,figsize=(30,100))
def predict_countries(countries, cause):
for x in range(len(countries)):
country = countries[x]
ax = axes.ravel()[x]
df = altogether[(altogether.CountryCode==country)&(altogether.Cause==cause)]
df['Predicted'] = df.apply(predict_point,axis=1)
sns.scatterplot(data=df, x='Year', y='DeathsPer100k',ax=ax)
sns.scatterplot(data=df, x='Year', y='Predicted',ax=ax)
ax.set_xticks(ax.get_xticks()[::5])
ax.set_title(df.iloc[0]['Country'])
ax.set_ylabel('DeathsPer100k')
ax.legend(['Actual','Predicted'])
predict_countries(mses.loc[:49,'CountryCode'],9)
Nice! This model appears to work for most developed and in-transition countries. A few countries look to be off by a simple translation parallel to the y-axis. This may be perhaps due to a genetic predisposition to die less from heart-related disorders or some other external factor. Countries can calibrate this model for their own usage with a few years of mortality, CMR and BMI data.
I will now show you in detail some curated graphs that display this model's effectiveness.
curated = ['LTU','USA','NZL','POL','PRT','CUB']
fig, axes = plt.subplots(2,3,figsize=(30,10))
predict_countries(curated,9)
Lithuania: The prediction is off by a simple translation, but I would like to draw attention to around the 1993 mark. When the actual figures unexpectedly (at least against the year) increased, the predicted values also increased. This shows that something happened in Lithuania that year which shifted deaths due to cardiovascular causes and BMI or CMR.
Poland: Same as Lithuania but at the 1996 mark. I wonder if it's due to the Soviet Union disintegrating.
United States: Pretty good prediction for one of the world's most important countries, starts deviating a bit towards the end, perhaps as the country reaches stages of late-development.
New Zealand: This model in general works well for developed countries, which is expected as the model is based off their data. The other option (use all data) would have way too much variability due to developing countries, and is much worse than this.
Cuba: As Cuba transitions towards being less of a developing country and in general becoming more stable, the actual figures become more in line with the predicted ones.
Portugal: Same as Cuba. This model is quite effective during the transition stage between developing and developed.
def predict_country(cause):
country = ""
while country != "STOP":
country = input("Enter alpha-3 country code: ").upper()
if country in altogether.CountryCode.values:
df = altogether[(altogether.CountryCode==country)&(altogether.Cause==cause)]
df['Predicted'] = df.apply(predict_point,axis=1)
display(df)
plt.figure(figsize=(20,10))
ax = sns.scatterplot(data=df, x='Year', y='DeathsPer100k')
sns.scatterplot(data=df, x='Year', y='Predicted',ax=ax)
ax.set_title(df.iloc[0]['Country'])
ax.set_ylabel('DeathsPer100k')
ax.legend(['Actual','Predicted'])
plt.show()
# uncomment this to choose any country you want to be displayed
# predict_country(9)
Countries with exceptionally low rates of death due to mental disorders should monitor the mental health of children more closely, and diagnose mental disorders early on. This would ensure that appropriate care is given to them but the truth is that in the end many of them will die later on life nonetheless due to their condition. Therefore, when a country's rate of death due to mental disorders is high, it means that it has succeeded in given it's disabled children a few extra years of life.
Countries can use their leading cause of death to judge their current state of development. If it is infectious diseases, then they are far behind the rest of the world and should work on sanitation. If it is illnesses of the circulatory system, then they are average and should implement measures such as advocating for a healthy lifestyle. If it is neoplasms, then they are very developed and there is not much they can do about it unless cancer can be cured.
Countries in the transition phase of their development can use the model in question 4 to roughly determine the number of deaths due to illnesses of the circulatory system. They can then compare the predicted value with the actual value to see if they have a disproportionately high or low number of deaths due to that cause at their current stage of development. If it is overly high, then the country should work on equipping its public spaces with facilities necessary (e.g. AEDs) to prevent deaths due to heart disease from occurring.
Russia should really advocate for a healthier diet among its citizens.